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Abstract — In previous work, an ordering result was given 
for the symbolwise probability of error using general Markov 
channels, under iterative decoding of LDPC codes. In this paper, 
the ordering result is extended to mutual information, under the 
assumption of an iid input distribution. For certain channels, 
in which the capacity-achieving input distribution is iid, this 
allows ordering of the channels by capacity. The complexity of 
analyzing general Markov channels is mitigated by this ordering, 
since it is possible to immediately determine that a wide class of 
channels, with different numbers of states, has a smaller mutual 
information than a given channel. 

I. Introduction 

A finite-state Markov channel is a channel with binary in- 
puts, where the instantaneous values of the channel parameters 
are selected by the state of a hidden Markov chain. Capacity 
and coding were originally studied for these channels in [1]. 

An ordering of communication channels may be accom- 
plished with respect to probability of error (for a specified 
code), or with respect to channel capacity. For instance, all else 
being equal, the Gaussian channel is ordered with respect to 
noise variance: higher noise variance means higher probability 
of error for any code, as well as lower capacity. Such orderings 
are attractive to researchers, since a capacity or probability of 
error result in one channel can be immediately extended and 
applied to other channels that are covered by the ordering. The 
problem of ordering communication channels can be traced 
back to Shannon [2], where a partial ordering was given for 
memoryless channels using general codes. 

Ordering results are particularly attractive for the analysis of 
Markov channels because of the large size of their parameter 
space: 0(k 2 ) parameters for a channel with k states. For 
example, if the mutual information using some channel c 
is known, it would be helpful for c to cast a "shadow" 
of neighboring channels where the mutual information was 
known to be smaller (or larger). In previous work [3], [4], we 
obtained ordering results for general Markov channels with 
respect to symbol error under iterative decoding of LDPC 
codes, and a key feature of that work was the ability to 
compare channels with different numbers of states. 

The contribution of the present paper is to generalize the 
ordering result from [3], [4] to mutual information under an 
iid input distribution. This resniction is used for three reasons: 

• Mutual information under an iid input distribution is by 
far the most practically interesting case for communica- 
tions engineers; 



> Under some circumstances, an iid input distribution is 

capacity-achieving [1], [5]; and 
• It is very difficult to analyze Markov channels under non- 

iid input distributions. 

The iid input distribution makes our results particularly ap- 
plicable to the achievable rates of contemporary error-control 
codes, whose codewords are generally considered to simulate 
iid input distributions. Furthermore, the result is more general 
than previous work, applying to the ultimate limits of any pos- 
sible code whose codewords satisfy the iid input disnibution, 
rather than being relevant only to LDPC codes. 

To adapt these ordering results to mutual information, 
different theoretical machinery is required. This is mostly 
because the ordering in [3], [4] was based on symbol error, 
but the mutual information is related to block error. Using 
symbol error, our approach was to add or delete certain 
"side information" until the desired structure of the decoder 
was achieved. In this paper, we start out by using a similar 
approach (including a proof technique initially used in [5]), 
although a completely different method is required to prove 
its relevance to the present ordering. Furthermore, to take 
the global nature of block errors into account, we introduce 
a lemma (Lemma Q]), related to the mutual information of 
channels with piecewise-Markov segments. 

The remainder of the paper is organized as follows. In 
Section H2 we describe our model for general finite-state 
Markov channels. In Section[ni] we describe a mixing operator 
(previously introduced in [3], [4]), which allows us to flexibly 
construct degraded channels with larger numbers of states than 
the original channels. Finally, in Section lTVl we state and prove 
our main result, and give some discussion concerning its use. 

II. Model 

In this paper, we will write constant scalars and vectors as 
x and x, respectively; and scalar and vector random variables 
as X and X, respectively. For a random variable X, a 
realization of the random variable will usually be written as 
the corresponding lower-case letter x. We also use bold upper- 
case letters to represent constant matrices, but it should be 
clear from the context when we mean a matrix and when 
we mean a vector random variable. Finally, for probability 
density functions (PDFs), such as fx{x), and probability mass 
functions (PMFs), such as px(x), we will omit the subscript 



when it is unambiguous to do so, and simply write f(x) and 
p(x) for PDFs and PMFs, respectively. 

Consider a channel c with inputs selected from an alphabet 
X, outputs selected from an alphabet y, and hidden chan- 
nel states selected from an alphabet S. The sets X and y 
could be discrete or continuous, but S is always discrete 
and finite for a finite-state Markov channel (for example, 
S = {1,2,...,|S|}). Let X G X n , Y e y n , and S e 
S n+1 represent random vectors, consisting of channel inputs, 
channel outputs, and channel states, respectively. 

We assume throughout the paper that S forms a regular 
Markov chain operating in steady state, which is independent 
of the channel inputs X. Furthermore, given the channel state 
S, we assume that the channel is memoryless, i.e., 



/(y I s,x) = Y[f(y t | s t ,x t ). 



(1) 



When both of these properties hold, then c is called a Markov 
channel. These properties exclude partial response channels 
from the discussion. Furthermore, the specification that the 
Markov chain is regular means that there exists a steady- 
state distribution for the state sequence S, and that the state 
probabilities converge to the steady-state distribution. 
Combining (Q]i with the PMF of S, we can write 

n 

/(y, s | x) = p(si) JJ f{y t | s t ,x t )p(s t+1 \ s t ), (2) 
t=i 

and the channel input-output relationship is given by marginal- 
izing (0 over s. 

From 0, the channel is fully parameterized by specifying 
p{s t+ i | s t ) and f(y t \ s t ,x t ). The values of p(s t+ i | s t ) are 
commonly specified in a |<S| x |<S| matrix P, known as the 
transition probability matrix. If S = [1,2,..., |<S|], then the 
element of P on the ith row and jth column is given by 



Pi 



Ps t+1 \s t U I 0- 



We assume that f(yt \st,xt) is drawn from a given family of 
channels, where s t corresponds to a particular channel param- 
eter for that family. For example, if f(y t | s t ,x t ) represents a 
binary symmetric channel, then each possible value of s t in S 
corresponds to an inversion probability. Thus, these parameters 
can be expressed in a vector n, where 



n = [771,772, 



Given the family, a Markov channel c is completely specified 
by the parameters 

c=(P,n). 

III. Degrading Markov channels 

A. Mixing operator 

We re-use the Markov channel mixing operator from [3], 
[4], which is based on a proof technique from [5]. Let Ci = 
(Pi,ni) and C2 = (P2,ri2) represent Markov channels. The 
hidden Markov chain in each channel is implemented by a 
Markov state machine, Mi and M2 for channels ci and C2, 



respectively, consisting of the possible states in each channel, 
connected by their transition probabilities. We assume that the 
set of states in ci and the set of states in C2 are disjoint. 

We will "mix" these channels by allowing jumps be- 
tween their respective Markov state machines, as follows. Let 
Xj(!^ 2 ) and U^ 2 ^ 1 ' represent Bernoulli random vectors of the 
same length as the state sequences, whose elements take values 
in {0, 1}. The vectors U^ 1 ^ 2 ) and U^ 2 ^ 1 ) are independent of 
the channel inputs X. Then the "mixed" state machine behaves 
as follows: 

• If the state at time t is in machine Mi, and f/ t =1, 
then the state at time t + 1 is in M2, chosen at random 
according to the steady-state probabilities of the states in 
M2, and independently of any previous state. 

• If the state at time t is in machine M2, and C/ t =1, 
then the state at time t + 1 is in Mi, chosen at random 
according to the steady-state probabilities of the states in 
Mi, and independently of any previous state. 

• If neither of these conditions hold, then the next state is 
chosen randomly according to the Markov chain proba- 
bilities in either Mi (if the current state is in Mi) or 
M2 (if the current state is in A^)- 

Let H12 and ^21 represent the probabilities Pi^C/' 1 ^ 2 ' = 1) 
and Pr(U^ 1] = 1), respectively. If U^ 2 ' and U^ 1 ' are 
not observed, it is straightforward to show that the resulting 
"mixed" channel has a transition probability matrix given by 



P' 



(1 - Mi2) p i M12P2 

M2lPl (l-/"2l)P2 



where Pi is a matrix with the same number of rows as P2 
and the same number of columns as Pi, where each row 
corresponds to the steady-state probabilities of the states in 
Ci; similarly, P2 is a matrix with the same number of rows as 
Pi and the same number of columns as P2, where each row 
corresponds to the steady-state probabilities of the states in 
C2. Furthermore, since the mixed state machine contains the 
union of the states from the original state machines, which 
were disjoint, the new vector of channel behaviors is given by 

n' = [m n 2 ]. 

We use the operator < — ► to represent this mixing operation. 
If the channel c' = (P', n') is formed in this manner from Ci 
and C2, with parameters /X12 and /121, we write 



c = 



Ci < — ► c 2 

1^12, H21 



We give an example to illustrate the use of the operator, as 
follows. 

Example 1: The Gilbert-Elliott channel [5] is a two-state 
Markov channel, where each state corresponds to a BSC with 
a different inversion probability. Let c be a Gilbert-Elliott 
channel with parameters 



(P,n) 



0.9 
0.1 



0.1 
0.9 



, [0.1,0.3] 



Also, let c* be another Gilbert-Elliott channel with parameters 



(P*,n*) 



0.9 0.1 
0.1 0.9 



[0.18,0.34] 



In both cases, it is easy to show that the steady-state prob- 
abilities of each state are given by Pi = P-2 = 0.5. Let 

then 



Mi2 = A f 2i = 0.1. In this case, if c' = (c < — > c*) 

M12, ^21 



c = 



( 


' 0.81 


0.09 


0.05 


0.05 




0.09 


0.81 


0.05 


0.05 




0.05 


0.05 


0.81 


0.09 


V 


0.05 


0.05 


0.09 


0.81 



[0.1,0.3,0.18,0.34] 



Notice that, if U (1 " 



/ 

(End of example.) 
+2 ) and U^ 2 ^ 1 ' are observed, then the 
Markov chain is divided into independent piecewise-Markov 
segments, with the divisions occurring at each transition be- 
tween the two state machines. This occurs because the new 
state is chosen with respect to the steady-state probabilities 
within the new state machine, independently of any previous 
state. Thus, if the input distribution /(x) is iid, the channel 
outputs y are also split into independent piecewise-hidden- 
Markov segments. 

B. Broken-chain degraded families 

We can form a family of degraded channels based on 
operations similar to < — >. Let V c represent a family of broken- 
chain degraded channels, degraded with respect to c, defined 
as follows. For all c* G T) c , there exists a (vector) random 
variable U with the following properties: 

• if U is unknown, then the channel is a Markov channel 
with parameters c*; 

• if U is known, then the channel is a piecewise-Markov 
channel, where each segment has parameters c; and 

• U is always independent of the channel inputs X. 
Furthermore, it is easy to see that the definition of V c is 

intended to be used with the operator < — ►, since that operator 
generates channels which are piecewise-Markov (although the 
parameters on those segments might be different). 

We make a few remarks on this definition. Firstly, it is quite 
easy to see that c £ T) c , since U can be empty. Secondly, if 
c* G T> c , and we form P c «, which is the degraded family 
of c*, then P c * C T> c , since the random vectors U can be 
concatenated. Thirdly, the random variable U need not neces- 
sarily break the Markov chain - if the Markov chain remains 
in one piece and remains Markov, it is trivially piecewise- 
Markov. For instance, if c is a Gilbert-Elliott channel, and 
c* is a channel formed by concatenating a channel having 
parameters c with an independent BSC, then U could be the 
independent BSC's noise sequence, which restores the original 
channel. 

IV. Main result 

A. Definitions and notation 

We briefly describe some important definitions and notation 
in this section. Recall that we are restricting ourselves to 



regular Markov chains (which is implicit in the term Markov 
channel), and iid input distributions. Firstly, since Markov 
channels have memory, the mutual information is defined as 

I(X;Y) = lim -J(X; Y), 

n— »oo 71 

where /(X; Y) represents the mutual information between the 
length-n vector random variables X and Y, and noting that 
the limit exists thanks to the restrictions we have imposed. 
Since we consider mutual information under various channel 
assumptions, we will write 

I[c](X;Y) 

to represent the mutual information in channel c. Similarly, 
for the vector version, we will write 7[c](X;Y). A segment 
of one of these vectors, for example from the ith element to 
the jth element, j > i, is written 



We will write J[c](X^; Yf) to represent the mutual informa- 
tion between these vector segments. 

B. Result 

The main result of this paper is stated in the following 
theorem. 

Theorem 1: Let c represent a Markov channel, and let T> c 
represent its degraded family. Suppose the input distribution 
is iid. If c* G V c , then 



I[c](X;Y)>I 



c < — ► c 

M12, Al21 



(X;Y) 



(3) 



for all < pi2 < 1, < 1121 < 1. 

To prove the Theorem, we first require the following useful 
lemma: 

Lemma 1: Let c represent a Markov channel. Then, if the 
input distribution is iid, 

J[c](X;Y)>±J[c](X*;Y*) 

for any k < oo. 

The lemma states that observing a truncated Markov channel 
never gives more mutual information than a Markov channel 
observed over an asymptotically long period of time. The proof 
for the Lemma is contained in Appendix lAl 

We also require the following lemma, which relates the 
degraded family V c to the operator < — ►: 

Lemma 2: If c* G T> c , then 

c < ► c* G V c . 

Proof: For channel c < — > c*, knowledge of U' 1 ~* 2 - ) and 
U( 2_fl ) breaks the channel into piecewise segments of c and 
c*. However, since c* G T) c , there exists U to transform each 
piecewise segment in c* to a segment in c. Taken together, 
Xj( 1 ^ 2 ) ; U^ 2 ^ 1 ', and U transform c < — ► c* into piecewise 
segments of c, which is the definition of a channel in V c . ■ 
The proof of Theorem [T] is then given as follows. 



Proof: Let c' = c < — ► c*. By Lemma |2 c' G X> c , 

M21 

so there exists a random variable U which transforms c' into 
piecewise-Markov segments of c. 

Let J represent an index set corresponding to the indepen- 
dent segments, let the subscript i,j represent the ith symbol 
in the jth segment, and let £(j) represent the length of the jth 
segment. Then we have that 



/(y I x, u) 

n 



i=l i=l 



Since X is iid (by assumption), then 

/(y I u) = 



n 

j'e.7 



which is accomplished by marginalizing over each Xi j. Thus, 
since both /(y | x, u) and /(y | u) are partitioned into 
independent segments, we can write 



J[c'](X;Y|U) = 
E 



(4) 



where the expectation is taken over J and £(j), which are 
functions of the random variables U. 

Since U is independent of X, it is true that 

J[c'](X;Y) < 7[c'](X;Y,U) 

= /[c'](X;Y|U) + /[c'](X;U) 

= 7[c'](X;Y|U). (5) 

Because the distribution of X is iid, we can rewrite (|4]i as 



7[c'](X;Y \U)=E 
From Lemma [T] we have that 



£/[c](xP;Yf>) 

jej 



E 



£/[c'](xf>;Y^) 



£*(j)J[c](X;Y) 
= nI[c](X;Y), (6) 



where the first inequality follows from the fact that each term 
under the expectation on the left is less than each term under 
the expectation on the right, and the last equality follows from 
the fact that the sum of the lengths £(j) of all the segments 
equal the length n of the sequence, regardless of how the 
sequence is divided. 



From (O, (|6), and the definition of I[c'](X;Y), we have 
that 

I[c'}(X;Y) = i/[c'](X;Y) 

< i/[c'](X;Y|U) 

which proves the theorem. ■ 
Notice, from Lemma that the channel c < — > c* goes 

back into the degraded family T> c . Thus, the ordering given in 
Theorem Q] can be applied recursively to create an ordering of 
arbitrary size. 

C. Discussion 

To illustrate the use of Theorem Q] we can expand Example 
[U As we mentioned in Section IIII-BI if c is a Gilbert-Elliott 
channel, then one member of V c is c concatenated with an 
independent BSC, where U is the noise sequence of the BSC. 
Thus, we have the following: 

Example 2: From Example [U it is straightforward to show 
that c* is formed by concatenating c by a BSC with inversion 
probability p = 0.1. Thus, c* G P c , and applying Theorem Q] 
it is true that I[c](X; Y) > I[d](X; Y). 

Furthermore, notice that the ordering can now be applied 
recursively: by combining c' with c (and optionally concate- 
nating c' with a BSC), we obtain a channel with six states, 
which is degraded with respect to c; continuing the process, we 
can obtain degraded channels with eight states, ten states, and 
so on, each time adding pn and p^\ as degrees of freedom. 

{End of example.) 

The assumption that the input density p(x) is iid is critical 
to our analysis. Unfortunately, as noted in [1], it is frequently 
difficult to prove capacity results for Markov channels with 
non-iid inputs (although general capacity results were given 
in [6], using Lyapunov exponents). We leave to future work 
the open problem of extending of our ordering to channels 
with general inputs. 

V. Acknowledgments 

The author wishes to acknowledge a stimulating discussion 
with Prof. Frank R. Kschischang, of the University of Toronto, 
that led him to pursue this problem. 

Appendix 

A. Proof of Lemma [7] 

We give the proof for discrete-valued random variables 
X, Y. It is straightforward to generalize the proof to the case 
of continuous-valued random variables, and we describe how 
to do so at the end. 

For convenience, suppose that there exists an integer h such 
that hk = n. Let the vector x be broken up into segments of 
length k, so that 



= x 



2A- 



Jik 



[x (1) ,x( 2 \ 



(fe)l 



where we use xW to represent 'X-a-i)k+v Similarly, the vector 
y is represented by 

y = [y (1) ,y (2) ,..j w l. 

Suppose that, after transmitting x' l \ the transmitter waits 
(and does not transmit) for (d— l)k channel uses, for some 
very large integer d, before transmitting x^ i+1 ^. Since (by 
assumption) the Markov chain is regular, the transition prob- 
ability matrix between the received value yY' and Vx is 
(almost) given by 



In fact, let (5 represent the maximum deviation from ), 
so that 

-s)< p(s { i +1) I < + s) (7) 

(i+l) (i) 

for all Sj and sL . If the Markov chain is regular, it is well 
known that 5 — > as d — > oo. 
Then we have that 

/i 

/(y|x)<(l + 5) h n/(y (i) l^) (8) 



and 



/(y|x)>(l-5)' i n/(y (l) |x (i) ). (9) 



Let e+ = (1+S) h , and let e~ = (1-S) h . Calculating H(Y\X), 
we have that 

H(Y | X) 

= - /(x,y)log/(y | x) 

- f p(x) f /(y|x)log/(y|x). (10) 
However, from the bounds above, we can write 

f /(y|x)log/(y|x) 
•'y 

/ft, h 
e+n/(y (0 |x«)loge-nMy W l* W ) 
i=l i=X 

= - e+ E / /(y ( ° |x W )log/(y« |x«) 



— e + log e 



(11) 



where the inequality follows from the fact that logp(y' 4 ' x^) 
is always negative. Combining ( TTOb and ( TTTT i, and recalling that 
p(x) is iid, we have that 

H(Y | X) = 

h 

= e+J2 H ( Y(i) I X( '° ) - e+ log e " 
= e + hH{Yl |Xj;')-e + loge"- 



Similarly, it can be shown that 

H(Y | X) > e _ /iif(Yf | X£) - e" loge+. 

Furthermore, since X is iid, then marginalizing (© and (O 
with respect to x, and following the same derivation, results 
in 

H(Y) < e + hH{Yl) - e+loge" 



and 



H(Y) > e~hH(Y$) - e~ loge+. 



Thus, lim^oo H(Y | X) = hH(Y^ | Xf) and 
limd^oo H(Y) = hH{Y\), so 

lim J(X;Y) = h(H(Y^) — H(Y^ | Xj 5 )). 

d — >oc 

Thus, the average information rate of this channel is given by 
lim JL/(X; Y) = ^(H(Yf) - H(Yf | Xf)), 

where /idfc is the total number of channel uses. 

Notice that there are [d— l)k channel uses left unused for 
every k that are used. We can fill these using the same method, 
transmitting for k channel uses and waiting (and ignoring the 
channel) for (d — l)k channel uses. In this case, the total 
information rate improves by a factor of d, to 

lim -L/(X; Y) = i(H(Yf) - ff(Y* | X*)), 
d^oo hdk k 

However, it is obvious that the channel capacity in this case 
is given by I(X;Y). Thus, by the data processing inequality, 

I(X;Y)>±(H(y*)-H(Y*\X.l)), 

and the lemma follows. 

To generalize the lemma to continuous-valued X, Y, it is 
necessary to take into account the fact that log/(y | x) could 
be positive. In this case, the inequality leading up to ( fTTT ) is 
broken up into integrals over which log /(y | x) is positive and 
negative, and the appropriate bound is used over both regions; 
the result is a mixture of the two bounds given above, so the 
convergence result holds. 
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